An edit operation-based approach to approximate string matching in large DNA databases

نویسندگان

  • Jiun-Rung Chen
  • Ye-In Chang
چکیده

In DNA related research, due to various environment conditions, mutations occur very often, where a mutation is defined as a heritable change in the DNA sequence. Therefore, approximate string matching is applied to answer those queries which find mutations. The problem of approximate string matching is that given a user specified parameter, k, we want to find where the substrings, which could have k errors at most as compared to the query sequence, occur in the database sequences. In this paper, we make use of a new index structure to support the proposed method for approximate string matching. In the proposed index structure, EII, we map each overlapping q-gram of the database sequence into an index key, and record occurring positions of the q-gram in the corresponding index entry. In the proposed method, EOB, we first generate all possible mutations for each gram in the query sequence. Then, by utilizing information recorded in the EII structure, we check both local order (i.e., the order of characters in a gram) and global order (i.e., the order of grams in an interval) of these mutations. The final answers could be determined directly without applying dynamic programming which is used in traditional filter methods for approximate string matching. From the experiment results, we show that our method could outperform the (k + s) qsamples filter, a well-known method for approximate string matching, in terms of the processing time with various conditions for short query sequences.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

Approximate String Joins in a Database (Almost) for Free

String data is ubiquitous, and its management has taken on particular importance in the past few years. Approximate queries are very important on string data especially for more complex queries involving joins. This is due, for example, to the prevalence of typographical errors in data, and multiple conventions for recording attributes such as name and address. Commercial databases do not suppo...

متن کامل

Bit-Parallel Approximate String Matching Algorithms with Transposition

Using bit-parallelism has resulted in fast and practical algorithms for approximate string matching under the Levenshtein edit distance, which permits a single edit operation to insert, delete or substitute a character. Depending on the parameters of the search, currently the fastest non-filtering algorithms in practice are the O(kn!m/w") algorithm of Wu & Manber, the O(!km/w"n) algorithm of Ba...

متن کامل

Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation

Classified s-grams have been successfully used in cross-language information retrieval (CLIR) as an approximate string matching technique for translating out-of-vocabulary (OOV) words. For example, s-grams have consistently outperformed other approximate string matching techniques, like edit distance or n-grams. The Jaccard coefficient has traditionally been used as an s-gram based string proxi...

متن کامل

Approximate String Matching in LDAP Based on Edit Distance

As the E-Commerce rapidly grows up, searching data is almost necessary in every application. Approximate string matching problems play a very important role to search with errors. Against these problems “Edit distance” and “Soundex” are two common techniques, especially the latter one is a “sound-like” method and had been applied to the LDAP server. Nevertheless, it is not adequate for certain ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003